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Abstract 



f-H We derive a divergence formula for a group of regularization methods with 

>-^ an I2 constraint. The formula is useful for regularization parameter selection, 

because it provides an unbiased estimate for the number of degrees of freedom. 
1— —1 We begin with deriving the formula for smoothing splines and then extend it 

to other settings such as penalized splines, ridge regression, and functional 
& Unear recession. 
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1 Introduction 

A variety of regularization methods have been proposed in modern statistics (Hastie et al, 
2009). Usually, the regularization is controlled by a tuning parameter, and it is crucial to 
select an appropriate value of the tuning parameter. Many criteria have been discussed 
for tuning parameter selection (Hastie et al, 2009). Some of these criteria, including AIC 
(Akaike, 1973), GCV (Graven and Wahba, 1979), and BIC (Schwarz, 1978), depend on 
estimating the number of degrees of freedom, which measures the model complexity. 

Here we consider the problem of estimating the number of degrees of freedom for a group 
of regularization methods, the ones with an \<i constraints. We derive a divergence formula, 
which provides an unbiased estimate for the number of degrees of freedom (Stein, 1981; Ye, 
1998; Efron, 2004). To understand our goal, we take smoothing slines as an example. 

Smoothing splines are a popular approach to nonparametric function estimation (Wahba, 
1990). Given observations 

Vi = f(xi) +e h i = l,-- ,n, (1) 

where x% G [0, 1] and e, ~ N(Q,a 2 ), one is to estimate f(x). Assume that f(x) is smooth in 
the sense that its second derivative exists and is small. The smoothing splines approach to 
the estimation of f(x) is through minimizing 

y>-/(z0) 2 + A/ {f {2) (x)} 2 dx, (2) 

where the penalty parameter A controls the tradeoff between the lack of fit and the roughness 
of the function estimation. An alternative derivation is through minimizing 

f> - fix,)) 2 s.t. f\fW(x)} 2 dx<p, (3) 

where p is called the constraint parameter. The solution to ([3]) usually falls on the boundary 
of the constraint, and by the Lagrange method, these two formulations are equivalent up to 
the choice of A and p. 



Gu (1998) pointed out that the mapping from p to A is one-to-one but changes with 
the least squares functional, YH=i(Vi ~ f( x i)) 2 - I n the literature, it is well-known that the 
divergence in terms of A is equal to the trace of the "hat" matrix (e.g., Hastie and Tibshirani, 
1990). However, the divergence formula in terms of p is still not available. Because of the 
importance of tuning parameter selection, it is worth deriving this formula. 

The remaining of the manuscript is organized as follows. In Section 2, the divergence 
formula in terms of p is derived for smoothing splines. In Section 3, the result is extended 
to some other settings. In Section 4, the two divergence formulas for smoothing splines, in 
terms of A and p respectively, are compared through a simulation study. Some discussion is 
in Section 5 and technical proof is in Appendix. 

2 Main Result 

2.1 Definition 

We start with reviewing the definition of degrees of freedom. In ordinary linear regression, 
the degrees of freedom simply count the number of parameters. There are some generaliza- 
tions of the degrees of freedom, including Stein's unbiased risk estimate in Stein (1981), the 
generalized degrees of freedom in Ye (1998), and the covariance penalty in Efron (2004). 

Because the definitions are similar for both A and p, we use 9 for the smoothing parameter, 
which could be either A or p. Let fe(-) be the solution to either (I2J) or (|3J) for a given 9, and 
then pi(9) = fe(xi) estimates pi = f(xi). Assume that y° is a new response generated 
from the same mechanism that generates yi. By arguments in Efron (2004), we obtain a 
decomposition of the prediction error, 

n n 

E{J2(V* ~ MO)?} = tf£> - MO)?} + 2E{div(9)}a 2 , (4) 

i=\ i=l 



where 

n 

div(0) = £)0fc(0)/% (5) 

is called the divergence in terms of 9 (either A or p) in Kato (2009), and its expectation 

DF(0) = E{dw{9)} (6) 

is defined as the degrees of freedom in terms of 9 (either A or p). Clearly the divergence is an 
unbiased estimate of the degrees of freedom. 

2.2 The divergence in terms of A 

Since the solution to (pi) is natural splines (Wahba, 1990), it can be written as 

d 

f{x) = Y J N j {x)f) j , (7) 

i=i 

where d = n in smoothing splines approach (d could be different from n in other settings), 
and {Nj(x)} is an n-dimensional set of basis functions for representing this family of cubic 
natural splines. Then the regularization problem ^2h becomes 

argminlly-N^III + A^'fiNA (8) 

where y = ( Vl , ■ ■ ■ ,y n )'. (3 = (ft, • • • ,&)', {N}^ = Njfa) and {n N } tj = J N^\t)Nf\t)dt. 
The "hat" matrix H(A) = N(N'N + Afljv) - N'y, and the divergence in terms of A is 
well-known to be equal to the trace of it (Hastie and Tibshirani, 1990), that is, 

div(A) = trace{H(A)}. (9) 

2.3 The divergence in terms of p 

To develop the divergence in terms of p, we rely on the Demmler-Reinsch algorithm used in 
Eubank (1988). The algorithm can also speed up the computation of div(A). 



Demmler- Reins ch algorithm. Let B be a d x d matrix satisfying B 1 (B l )' = N'N, 
where N is n x d. Let U be orthogonal and C be diagonal such that UCU' = BfijyB'. 
Define Z = N(B'U) and 7(A) = U / (B- 1 )'3(A), where 3(A) is the solution to 0. Then, 



N/3(A) = Z 7 (A), (I + AC)9(A) = Z'y, and trace{H(A)} = £$ =1 (1 + A<y) _1 , where r is the 
rank of C, and c\ > C2 > • • • > c r are the non-zero diagonal elements of C. 
Up to the choice of A and p, the problem ([8]) is equivalent to 

argmin ||y-N/3||| s.t. j3'n n (3 < p. (10) 



Because /3'fijv/3 = j'Cj, the problem (10) becomes 



argmin ||y — Z7H2 s.t. m f'C"f < p. (11) 



Let 7(p) = (71, • • • ,7d)' be the solution to the problem (11) with constraint p and 7 be 
the solution to the problem without constraint. By some tedious arguments in Appendix, we 
derive the following divergence formula in terms of p. 



Theorem Following the notation in the description of Demmler- Reinsch algorithm, 

r-l 

~+ 



3=1 

where r = ||7 — "T(/o)| I2 and for j = !,-•• , r — 1 



r-l 1 

di v(p) = (d-r) + Y, T --., (12) 



„2^2 1 „2 Cp2 



E[=i7fcf c Pj +c j+iV+i 



3 Some extensions 

The formula in the theorem can be extended to many other settings; for example, penalized 
splines, ridge regression, and functional linear regression. For these settings, div(A) is equal 
to the trace of the corresponding hat matrix. 



3.1 Penalized splines 

Penalized splines approach was proposed by Eilers and Marx (1996) to estimate f(x) in (TH). 
For fixed order p and knots K\ < ■ ■ ■ < Kk, penalized splines approach finds a function of 
form f(x) = P + Yfj=i &j xJ + Z)fc=i Pp+k{x - K k ) p + that minimizes 

n K n K 

Y J {y l -f{x l )f + \Y,Pl +k or 5> 4 -/(z,)) 2 s.t. Y.fi+k<P- ( 13 ) 

%=1 k=l 1=1 fc=l 

An advantage of penalized splines over smoothing splines is that K is much smaller than n. 
Ruppert (2002) claimed that the choice of K is not important as long as it is large enough. 
To apply the theorem to derive div(p) in penalized splines, let N be an n x (p + K) 



matrix with the ith row being (l,Xi,--- ,x?,(xi — «i):L ••• , (x« — Kk)+), and let fi 



v 



diag{O p+ i, 1k}j a (p + -^Q x (p + -K") matrix. 

3.2 Ridge regression 

Ridge regression was proposed by Hoerl and Kennard (1970). Given data {(xj,yj) G MP x 
R, i = 1, • ■ ■ , n}, the ridge regression finds a j3 = (/3\, • • • , /3 P )' that minimizes 

n p n p 

Y,(Vi - A) - /3'xi) 2 + A £ $ or £( w - ft - /3'x*) 2 s.t. £ /3 2 < p. (14) 

4 = 1 j = l 1=1 J = l 

To apply the theorem to derive div(p) in ridge regression, let N be an n x (p + 1) matrix 
with the ith row being (1, x%, ■ ■ ■ , x p ), and let On = diag{0, l p }, a (p + 1) x (p + 1) matrix. 

3.3 Functional linear regression 

Given data {(xj(-), y«) G ^([0, 1]) x R,i = 1, • • • , n}, consider functional linear regression, 

yi = a + Xi(t)(3 (t)dt + ei, 
Jo 

where /3o is assumed to be in a Sobolev space of order 2, W|([0, 1]). To estimate fo[x] = 
a + /n x (t)Po(t)dt, find one functional which minimizes 

y>-/N) 2 + A/ [/3( 2 )(t)] 2 (it or ^(^-/[x,]) 2 s.t. / [pW(t)} 2 dt<p, (15) 
i=l ^ i=l J ° 



among {/ : £ 2 ([0, 1]) -»• K | f[x] =a + f* x(t)/3(t)dt : a E M, G W|([0, 1])}. 

For the functional linear regression, Yuan and Cai (2010) developed a reproducing kernel 



HUbert space (RKHS) approach. They showed that the solution to (15) can be written as 



(3(t) = di + d 2 t + s rc i \ \xi(t) - x(s)]K(t, s)ds, 



where x(s) = Y2 x i( s )/ n an d K(t,s) is a kernel function. Let S be an n x n matrix where 
{E}ij = J J[xi(t) — x(s)]K(t, s)[xj(t) — x(s)]dsdt, T an n x 2 matrix where {T}jj = J[xi(t) — 



x(t)]t J dt, d = (di, ^2)', and c = (ci, • • • , c n )' . Then problem (15) becomes 



||y- (Td + 5]c)||^ + Ac / Ec or ||y - (Td + Sc)||| s.t. c'Sc < p. 

To apply the theorem to derive div(p) in functional linear regression, let the QR decom- 
position of T be (Qi : Q 2 )(R' : 0)' where Qi is n x 2, Q 2 is n x (n - 2), Q = (Qi : Q 2 ) 
is orthogonal and R is upper triangular, with T/Q 2 = 0. Since T'c = 0, c must be in the 
column space of Q 2 , giving c = Q 2 r/ for some r\ an n — 2 vector. Therefore, to apply the 
theorem, let N be replaced by n x n matrix (SQ 2 : T) and let CIn be replaced bynxn 
matrix diag(0, S). 

4 Simulation studies 

In this section, we conduct a simulation study to verify the divergence formula in terms of p 
for smoothing splines. We adopt the simulation setting in Gu (1998). On X{ = (i — 0.5)/100, 
% = 1, • • • , 100, we generated 100 replicates of data from ([!]) with f(x) = 1 + 3sin(27rx — %) 
and a 2 = 1. As in Gu (1998), for A on a fine grid of log 10 nA = (— 5)(0.05)(— 1), we calculated 
the solution to &2n, and then determined retrospectively the corresponding p = L p 2 '{x)dx. 
This implies that the connection between A and p is replicate-specific. 

First, we compare the convergence formulae in terms of A and p respectively. For each 
replicate, at each A in the grid, f(x) was calculated, the corresponding p is calculated, and 



then div(A) and div(p) are calculated through ((91) and (12) respectively. For the first 10 



replicates, the divergences are summarized in Figure 1. In the left panel, the curves of div(A) 
against A (they are identical) are drawn in red and the curves of div(p) against A are drawn 
in blue. In the right panel, the curves of div(A) against p are drawn in red and the curves of 
div(p) against p are drawn in blue. 

Figure 1: Divergence Formula (blue for p and red for A) 
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Further, as an application, the divergence formulas of A and p can be used in construction 
of smoothing parameter selection criteria such as GCV and AIC. Again, because the criteria 
are commonly defined for both A and p, we use notation 9, which could be either A or p. Let 
RSS(6>) = Y^i=x\Vi ~ fii(®)] 2 be the residual sum of squres. An effective smoothing parameter 
selection criterion is Akaike Information Criterion (AIC; Akaike (1973)), 



AIC(0) = logRSS(0) + 2div(0). 



(16) 



Another effective smoothing parameter selection criterion is Generalized Cross- Validation 
(GCV; Craven and Wahba (1979)), 

RSS(0) 



GCV(#) 



(n-div(0)) 2 ' 

8 



(17) 



We compare the performances of AIC and GCV in terms of A and p. Following Caution 1 
in Gu (1998), we consider the risk function indexed in terms of p, Risk(p) = E- Ym=i if( x i) ~ 
f(xi)) 2 , where the expectation is with respect to t%. By Caution 1, the risk function indexed 
in terms of A is meaningless because model index A is data-specific. The comparison is based 
on the following relative error in Hastie et al. (2009, p. 241), 



100 x 



Risk(p) — min p Risk(p) 



(18) 



maxp Risk(/9) — min p Risk(p) 

We should explain p in the above formula. If AIC(p) is applied, p = argminp AIC(p). If 
AIC(A) is applied, p is defined as the counterpart of A = argmin^ AIC(A). Similar p is 
defined for GCV. The results are summarized in Figure 2. It is found that the performances 
of criteria in terms of p are almost the same as those in terms of A. This finding supports 
the correctness of the derived formula. 

Figure 2: Relative Error 
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5 Discussion 

For regularization methods with an l\ penalty, the divergence formula (in terms of A, if 
adopt our notation) was derived in Zou et al. (2007). For regularization methods with an l\ 
constraint, the divergence formula (in terms of p, if adopt our notation) was derived in Kato 
(2009). This manuscript considers the divergence formula for the I2 regularization, which 
appears ahead of the l\ regularization. 

Although the divergence formula (in terms of A) for regularization methods with an I2 
penalty has been existing long ago, the divergence formula (in terms of p) for regularization 
methods with an I2 is still not available in the literature. Now this missing formula is derived. 

Appendix 

In this appendix, following Kato (2009), we develop the divergence formula in terms of 
p. On the boundary S7 = {7 € R d : 2~^=i c j7? = P}> 1 can De transformed into polar 
coordinates: 7 = (y/~pu(9i, • • • , 9 r -±), 7r+i, ■ ■ ■ , Id)' ■> with u(9j, • • • , r -i) defined as 
cosOj sin 0j cos 0j+i sin#j sin0j+i ■ ■ ■ cos0 r _i sinflj sin#j+i ■ ■ ■ sm9 r -\ ., 

where < 9j < 7r, j = 1, ■ ■ • ,r — 2, and < 9 r -\ < 2ir. On the boundary, which is in a 
(d — l)-dim smooth minifold, the partial derivative of 7 with respect to 9j is given by 
37 



/ \> 



— - y /psm6i---sm0 j - 1 (O , j _ 1 ,v(9 j ,--- , 6» r _i), d _ r) , 

with Oi defined as /-dim vector of all components being zero, and v(9j, ■ ■ ■ , 9 r -\) as 

sin 9j cos 9j cos dj+i cos 9j sin 0j+i ■ ■ ■ cos 9 r -\ cos 9j sin 9j+i ■ ■ ■ sin # r _i , 



Furthermore, on the boundary, the second partial derivatives, for 1 < j < k < r, are 
_ = - y /psm9i---sm9 j -i(0' j _ 1 ,u(9 j ,--- , 9 r -i) , 0' d _ r )' , 

y/psm9i ■ ■ -sin^j^i cos^j sin^j+i • • • sin9k-i(0 / k _ 1 ,u(9k, • • • , 9 r -i),0 d _ r )'. 



d9jd9 k 

10 



Define as v the following vector which is orthogonal to the tangent space of Q at -f, 

(\/ci cos^i, i/c^sinOi cos #2, ■ ■ ■ , \JCr-\ sin #1 sin 6*2 • • • cos# r _i, y^sin^i • • • sin0 r _i, 0' d _ r )' , 

and i/ = u/\\u\\ 2 . 

We are ready to calculate the first fundamental form and second fundament form defined 
in Kato (2009, p.1342-1343). For this aim, let = (0i,--- A-i)' and u = (7 r +i,--- ,Jd)'- 
The first fundamental form equals, noting that -^jq^t = Id—r an d iwduy = 0(r— i)x(d-r)> 

G = diag(Gn, Jd-r), 

where Gn = -gg-jM = £'diag~ (ci, • • • ,c r )L, with L being the followng r x (r — 1) matrix, 

1=1 \ v"(ej,--- ,e r -i) ) L=1 \ » u («r-i) 

and v°(6j, • • • , 6 r -\) being the following r — j + 1 vector, 

r-2 r-1 

(— sin0j,cos0j cosOj+i, ■ ■ ■ ,cos0j I[ sin^cos^r-i^cos^j ][ sin^)'. 

l=j+l l=j+l 

The second fundamental form equals, noting that qqq^i = 0(r—i)x(d-r) an d a^dL 1 = ®(d-r)x(d-r)i 

H = diag(i?n, 0(d-r)x(d-r))> 

where H\\ is a (r — 1) x (r — 1) matrix with the (j, k) component being — r < v, qq.qq >■ 
Here < -, ■ > is the ordinary Euclidean inner product in R . It can be verified that H\\ = 
rdiag(/ii,--- ,h r -i) = J^J JL, where hj = y/psm 2 Q x ■ ■ ■ sin 2 ^_i/||i/|| 2 . 

By Lemma 3.2 in Kato (2009), we can obtain the divergence formula in terms of p, 

div(p) = (d-r) + X: i -^- > 
j=\ 

where <j)j , j = 1 , ■ ■ ■ , r — 1 , are the eigenvalues satisfying the equation 

det(#n - 0Gn) = 0. 
11 



To find the eigenvalues, note that H\\ — <pG\\ = L'dmg{d\, • • • , d r )L, where dj = 7=rnji f:- 

It can be verified the jth diagonal component of L'diag(di, • • • , d r )L equals 

i-2 

e j(4>) = IT s i n Ql(dj~i si n Qj-\ + dj cos 0j~i/ cos Oj), 

1=1 

for j = 1, • • • , r — 1. Let ^ be the solution to the equation ej{4>) = 0, j = 1, • ■ ■ , r — 1. Since 
matrix L'diag(di, • • • , d r )L is non-negative definite, we can conclude that (f>j, j = 1, ■ • • , r— 1, 
are the eigenvalues we need. It is easy to see that <f)j is the solution to ej((j)) = 0, and therefore 
the divergence formula in terms of p is obtained. □ 
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